Vector Embedding Introduction¶

  • Thomas Fuchs, Data Scientist
  • Georg M. Sorst, Team Lead Search

No description has been provided for this image

Formerly known as: No description has been provided for this image

Vector Embeddings¶

Special Neural Nets can transform text into vectors.

These vectors can be embedded into a common vector space.

This makes it possible to discover semantic relationships between texts.

Let's define some words.

In [3]:
words = [
    "queen",
    "king",
    "prince",
    "princes",
    "man",
    "woman",
    "boy",
    "girl",
    "red",
    "green",
    "blue",
    "palace",
]

Transforming words into vectors is easy with Python.

Many free models exist to perform vector embedding.

In [4]:
def embed(texts):
    model_name = "all-MiniLM-L6-v2"
    model = SentenceTransformer(
        model_name,
        device=helpers.get_torch_device_name(),  # Optional: if you want to run this on GPU
    )
    return model.encode(texts)

The resulting vector is represented as a multidimensional array in Python.

All vectors share the same dimensionality, for this model it's 384 dimensions.

In [5]:
pd.DataFrame(embed(words[0]))
Out[5]:
0
0 0.035487
1 -0.065605
2 -0.009935
3 0.031590
4 -0.013387
... ...
379 0.026038
380 0.091385
381 -0.053889
382 -0.031242
383 -0.086961

384 rows × 1 columns

Each word is transformed into a vector so that we can discover semantic relationships.

In [6]:
pd.DataFrame(
    {"Sentence": words, "Encoding": list(embed(words))}
).head(3)
Out[6]:
Sentence Encoding
0 queen [0.035486974, -0.06560464, -0.009934947, 0.031...
1 king [-0.05959935, 0.050512373, -0.0695101, 0.07968...
2 prince [-0.036828797, 0.041282006, 0.04185658, 0.0417...

Let's visualize the vectors to show their relationships.

But 384-dimensional vectors cannot be plotted on a 2-dimensional screen.

Principal Component Analysis (PCA) can reduce the number of dimensions from 384 to 2.

In [7]:
word_samples = words[0:3]
embeddings = embed(word_samples)
reduced_embeddings = PCA(n_components=2).fit_transform(embeddings)
pd.DataFrame(
    {"Words": word_samples, "Encoding": list(reduced_embeddings)}
)
Out[7]:
Words Encoding
0 queen [-0.2934046, -0.3892648]
1 king [-0.25320083, 0.40883183]
2 prince [0.54660565, -0.019567147]

When visualizing the reduced vectors, clear semantic clusters appear.

In [9]:
plot(words, embed(words))
No description has been provided for this image

Sentence Embeddings¶

We can not only embed words but entire sentences and documents.

Let's define some documents and embed them.

In [10]:
import pandas as pd

documents = [
    "Vector embeddings are mathematical representations of objects, often words or phrases, in a high-dimensional space. By mapping similar objects to proximate points, embeddings capture relationships and semantic meaning. Commonly used in machine learning and natural language processing tasks, methods like Word2Vec, GloVe, and FastText have popularized their application, enabling advancements in text analysis, recommendation systems, and more.",
    "Keyword search refers to the process of locating information in a database, search engine, or other data repository by specifying particular words, phrases, or symbols. In the digital realm, it's foundational to search engines like Google and Bing. The search results are typically ranked based on relevance, which is determined using various algorithms that consider factors like frequency, location, and link structures. Keyword search is integral for navigating the vast expanse of online information, aiding users in retrieving relevant data efficiently.",
    "Sandwiches are a popular type of food consisting of one or more types of food, such as vegetables, sliced meat, or cheese, placed between slices of bread. They can range from simple combinations like peanut butter and jelly to more complex gourmet creations. Originating from England in the 18th century, sandwiches have become a staple in many cultures worldwide, prized for their convenience and versatility. Variations exist based on regional preferences, ingredients, and preparation methods.",
    "Data science is an interdisciplinary field that leverages statistical, computational, and domain-specific expertise to extract insights and knowledge from structured and unstructured data. It encompasses various techniques from statistics, machine learning, data mining, and big data technologies to analyze and interpret complex data. Data science has applications across numerous sectors, including healthcare, finance, marketing, and social sciences, driving decision-making, predictive analytics, and artificial intelligence advancements. Its growing significance in today's data-driven world has led to the rise of specialized tools, methodologies, and educational programs.",
    "Neural networks are a class of machine learning models inspired by the biological neural networks of animal brains. They consist of interconnected layers of nodes, or neurons, which process input data through a series of transformations and connections to produce output. Neural networks are particularly adept at recognizing patterns, making them useful for a wide range of applications such as image and speech recognition, natural language processing, and predictive analytics. The development of deep neural networks, which contain multiple hidden layers, has been central to the field of deep learning and has significantly advanced the capabilities of artificial intelligence systems.",
    "Pasta is a staple food of traditional Italian cuisine, with the first reference dating to 1154 in Sicily. It is typically made from an unleavened dough of durum wheat flour mixed with water or eggs and formed into sheets or various shapes, then cooked by boiling or baking. Pasta is versatile and can be served with a variety of sauces, meats, and vegetables. It is categorized in two basic styles: dried and fresh. Popular around the world, pasta dishes are central to many diets and come in numerous shapes like spaghetti, penne, and ravioli.",
    "Soup is a liquid food, generally served warm or hot (but also cold), that is made by combining ingredients such as meat and vegetables with stock, juice, water, or another liquid. Soups are inherently diverse, ranging from rich, cream-based varieties to brothy and vegetable-laden concoctions. They are often regarded as comfort food and can be served as a main dish or as an appetizer, with regional and cultural variations like the Spanish gazpacho, Japanese miso soup, and Russian borscht.",
    "A casserole is a comprehensive one-dish meal baked in a deep, ovenproof dish with a glass or ceramic base. It typically includes a combination of meats, vegetables, starches like rice or potatoes, and a binding agent like a soup or sauce. Topped with cheese or breadcrumbs for a crispy crust, casseroles are appreciated for their convenience and the ability to meld flavors during the baking process. They are a fixture in many cultures and are particularly beloved as home-cooked comfort foods, often featuring in communal gatherings and family dinners.",
]

pd.DataFrame(
    {"Sentence": documents, "Encoding": list(embed(documents))}
).head(3)
Out[10]:
Sentence Encoding
0 Vector embeddings are mathematical representat... [-0.0016682143, -0.069409996, -0.02650509, 0.0...
1 Keyword search refers to the process of locati... [0.019650524, -0.06271498, -0.045780774, -0.00...
2 Sandwiches are a popular type of food consisti... [-0.044322807, -0.023782454, 0.036511306, -0.0...

Again, semantic clusters appear when visualizing the vectors in a 2D-space.

In [12]:
plots([(documents, embed(documents), "green")])
No description has been provided for this image

Information Retrieval¶

Similar documents have similar vectors.

This characteristic can be used to retrieve related documents for an input text.

Let's start by defining some search queries.

In [13]:
queries = [
    "information retrieval",
    "machine learning",
    "cooking",
]
plots([(queries, embed(queries), "red")])
No description has been provided for this image

Visualizing documents and queries in one space uncovers semantic relations.

Each query is closests to its most relevant documents.

In [14]:
plots(
    [
        (documents, embed(documents), "green"),
        (queries, embed(queries), "red"),
    ]
)
No description has been provided for this image

Simply Search¶

Load Data¶

Firstly, we need to load data. To do this, we use the product data from a customer.

In [16]:
df_products.head(3)
Out[16]:
productId name description brand category
0 1303906844777 Black KeepCup Small Gym+Coffee branded Black Keepcup. The easy cho... Gym+Coffee [Versatile Collection, Autumn Fits, All-In Col...
1 1303907074153 Black KeepCup Medium Gym+Coffee branded Black Keepcup. The easy cho... Gym+Coffee [Versatile Collection, Autumn Fits, Gifts Unde...
2 1316361076841 U-Move Tank The essential U-Move tanks were designed with ... Gym+Coffee [T-Shirts & Tanks, Versatile Collection, Tanks...

Embed Data¶

The next step is to embed the data.
In a real-world scenario, we use specialised programs such as ElasticSearch to apply embeddings and assign weights to different fields.
The advantage of using embeddings is that different fields can be combined in advance to achieve a favourable result.

In [17]:
combine_fields = (
    lambda x: f"Product name = {x['name']}\n"
    f"Description = {x['description']}\n"
    f"Categories = {x['category']}\n"
    f"Brand = {x['brand']}"
)
In [19]:
df_for_search.head(5)
Out[19]:
base_string embeddings
0 Product name = Black KeepCup Small\nDescriptio... [-0.075529054, 0.019285059, -0.006989161, 0.08...
1 Product name = Black KeepCup Medium\nDescripti... [-0.064423464, 0.025618447, 0.0058485977, 0.07...
2 Product name = U-Move Tank\nDescription = The ... [-0.08300711, 0.060539, -0.020685865, 0.051009...
3 Product name = U-Stretch Tank\nDescription = B... [-0.08729552, 0.061420113, 0.026349582, 0.0778...
4 Product name = U-Live Tank\nDescription = Made... [-0.0651835, 0.04733741, 0.011253889, 0.062099...

Calculate Similarity¶

The most frequently employed method for assessing similarity is through cosine_similarity or cosine distance.
$ \text{cosinus-similarity}=S_{C}(A,B):=\cos(\theta)={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf { A} \|\|\mathbf{B}\|}= \frac{\sum \limits_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits_{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits_{i=1}^{n}{B_{i}^{ 2}}}}}\in [-1,1]$
$\text{cosinus-distance}=D_{C}(A,B):=1-S_{C}(A,B)$
ATTENTION: not really a distance-metric
We can leverage the pre-existing functionality provided by sklearn for this purpose.

In [22]:
fig.show()

Approximate calculation of similarity¶

With ANNOY (Approximate Nearest Neighbors Oh Yeah) we can significantly increase the efficiency of our search processes.
To achieve this, we create an index that is not only very powerful, but also compact.

In [24]:
ann_index: AnnoyIndex = get_annoy_index(df_for_search, n_trees=20)
In [27]:
fig_annoy.show()

Query with Annoy¶

In [29]:
query_easy = "hoodie"
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_easy, 5), :
]
html_easy = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_easy}'",
)
In [30]:
display(HTML(html_easy))

ANNOY VectorSearch for:
'hoodie'

No description has been provided for this image
Women's 2Tone Hoodie in Blush-Pink
No description has been provided for this image
Women's Midnight Navy Hoodie
No description has been provided for this image
FREE GIFT | Men's Chill Pullover Hoodie in Black
No description has been provided for this image
FREE GIFT | Men's Chill Pullover Hoodie in Black
No description has been provided for this image
FREE GIFT | Men's Chill Pullover Hoodie in Black

Complex Query with Annoy¶

In [31]:
query_complex = "I Need a new hody for my Frau. It soll be green."
# I need a new hoodie for my wife. It should be green.
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_complex, 5)[0:5]
]
html_complex = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)
In [32]:
display(HTML(html_complex))

ANNOY VectorSearch for:
'I Need a new hody for my Frau. It soll be green.'

No description has been provided for this image
Women's Green Chill Hoodie Gift Box
No description has been provided for this image
Kinney Crew in Fern Green
No description has been provided for this image
UniCrew in Hunter Green
No description has been provided for this image
Iconic Blend Beanie in Beige Melange
No description has been provided for this image
Pine Green Beanie

Same query - same index --- more results¶

In [33]:
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index, query_complex, 500)[0:5]
]
html_complex_more_result = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)
In [34]:
display(HTML(html_complex_more_result))

ANNOY VectorSearch for:
'I Need a new hody for my Frau. It soll be green.'

No description has been provided for this image
Women's Green Chill Hoodie Gift Box
No description has been provided for this image
Women's Green Chill Hoodie Gift Box
No description has been provided for this image
Kinney Crew in Fern Green
No description has been provided for this image
Retro UniCrew in Hunter Green
No description has been provided for this image
Kin Crew in Pacific Green

Same query - index with more tree --- few results¶

In [35]:
ann_index_more_tree: AnnoyIndex = get_annoy_index(
    df_for_search, n_trees=200
)
sim_prod = df_products.loc[
    get_similar_products_annoy(ann_index_more_tree, query_complex, 5)[
        0:5
    ]
]
html_complex_more_tree = helpers.display_images_and_names(
    sim_prod,
    merchant_id,
    f"ANNOY VectorSearch for:<br>'{query_complex}'",
)
In [36]:
display(HTML(html_complex_more_tree))

ANNOY VectorSearch for:
'I Need a new hody for my Frau. It soll be green.'

No description has been provided for this image
Women's Green Chill Hoodie Gift Box
No description has been provided for this image
Women's Green Chill Hoodie Gift Box
No description has been provided for this image
Kinney Crew in Fern Green
No description has been provided for this image
Kin Crew in Pacific Green
No description has been provided for this image
UniCrew in Hunter Green

Vector Search: Advantages and Disadvantages¶

Advantages¶

  1. Efficiency:
    • Vector search allows for fast and efficient similarity searches in high-dimensional spaces.
  2. Scalability:
    • Well-suited for large datasets and can scale effectively with the growing volume of data.
  3. Flexibility:
    • Adaptable to various data types, making it versatile for different domains such as image, text, and audio.
  4. Semantic Understanding:
    • Captures semantic relationships, enabling more meaningful and context-aware search results.

Disadvantages¶

  1. Complexity:
    • Implementation and optimization of vector search algorithms can be complex, requiring specialized knowledge.
  2. Resource Intensive:
    • Computationally intensive, demanding significant computing resources for large-scale applications.
  3. Quality of Embeddings:
    • The effectiveness of vector search heavily depends on the quality of the embeddings, which may require fine-tuning.
  4. Interpretability:
    • Results may lack interpretability, making it challenging to understand the reasoning behind specific search outcomes.

time comparison¶